Skip to content

feat: async Python client for non-blocking sandbox operations#510

Open
AndyJCai wants to merge 5 commits intokubernetes-sigs:mainfrom
AndyJCai:feat/async-python-client
Open

feat: async Python client for non-blocking sandbox operations#510
AndyJCai wants to merge 5 commits intokubernetes-sigs:mainfrom
AndyJCai:feat/async-python-client

Conversation

@AndyJCai
Copy link
Copy Markdown

@AndyJCai AndyJCai commented Apr 2, 2026

Summary

Adds AsyncSandboxClient for non-blocking sandbox operations using httpx and kubernetes_asyncio. Supersedes #256.

Credit: Builds on @raceychan's #256 — the core idea, dependency choices, and API shape originate from that PR. This version rebases onto the current modular k8s_agent_sandbox structure and incorporates @SHRUTI6991's review feedback (deduplication, tests).

pip install k8s-agent-sandbox[async]
from k8s_agent_sandbox import AsyncSandboxClient
from k8s_agent_sandbox.models import SandboxDirectConnectionConfig

config = SandboxDirectConnectionConfig(
    api_url="http://sandbox-router-svc.default.svc.cluster.local:8080"
)

async with AsyncSandboxClient(connection_config=config) as client:
    sandbox = await client.create_sandbox("python-sandbox-template")
    result = await sandbox.commands.run("echo hello")
    print(result.stdout)

Design decisions

Decision Why
connection_config required Sync defaults to LocalTunnel (kubectl port-forward), which is inherently synchronous. Fail fast instead of silently pointing at localhost.
except BaseException in create cleanup asyncio.CancelledError is a BaseException. Without this, cancelled tasks leak orphaned K8s claims.
asyncio.Lock on shared state Dict mutations interleave at await points. Guards check-then-act on _active_connection_sandboxes.
try/finally on watch streams Ensures w.close() runs on exceptions/cancellation, preventing leaked sessions.
Manual HTTP status retry loop httpx.AsyncHTTPTransport(retries=N) only retries connection errors, not 500/502/503/504. Manual loop matches sync client behavior.

AI disclosure

This PR was prepared with AI tooling (Cursor). All changes were reviewed, understood, and verified by the author.

@netlify
Copy link
Copy Markdown

netlify bot commented Apr 2, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 7b011bc
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69cf155bb8ddc900089f9610

@k8s-ci-robot k8s-ci-robot requested review from igooch and janetkuo April 2, 2026 21:00
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Apr 2, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @AndyJCai!

It looks like this is your first PR to kubernetes-sigs/agent-sandbox 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/agent-sandbox has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @AndyJCai. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 2, 2026
@AndyJCai AndyJCai force-pushed the feat/async-python-client branch 2 times, most recently from 147f81f to 812d058 Compare April 2, 2026 21:08
Add AsyncSandboxClient and supporting async modules so users can
await sandbox creation, command execution, and file I/O without
blocking the event loop. This is needed for async frameworks like
FastAPI, aiohttp, and async agent orchestrators.

Key design decisions:
- Explicit retry logic matching sync client's status-code retries
- asyncio.Lock guards on shared state for coroutine safety
- BaseException handling to catch CancelledError during create
- try/finally on all K8s watch streams to prevent leaked connections
- Required connection_config (no LocalTunnel support in async)
- Lazy __getattr__ import with actionable ImportError message

Optional deps: pip install k8s-agent-sandbox[async]

Supersedes kubernetes-sigs#256.
@AndyJCai AndyJCai force-pushed the feat/async-python-client branch from 812d058 to 0156958 Compare April 2, 2026 21:16
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 2, 2026
@AndyJCai
Copy link
Copy Markdown
Author

AndyJCai commented Apr 2, 2026

@barney-s Mostly inspired by #256, which went stale after a rebase was needed. This PR rebases that work onto the current k8s_agent_sandbox module structure and addresses the review feedback (code deduplication, tests). Would appreciate your review when you get a chance.

- Clear _active_connection_sandboxes after closing connections to
  prevent stale references
- Set httpx.AsyncClient timeout to 60s matching sync client behavior
  (httpx defaults to 5s)

Made-with: Cursor
@aditya-shantanu
Copy link
Copy Markdown
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 2, 2026
@aditya-shantanu
Copy link
Copy Markdown
Contributor

/assign @SHRUTI6991

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AndyJCai
Once this PR has been reviewed and has the lgtm label, please ask for approval from shruti6991. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@AndyJCai
Copy link
Copy Markdown
Author

AndyJCai commented Apr 3, 2026

/test presubmit-agent-sandbox-e2e-test

AndyJCai added 2 commits April 2, 2026 17:31
- Remove dead retry branch in async connector (retryable status codes
  are handled before raise_for_status, so the except handler never
  sees them)
- Narrow except BaseException to except (Exception, CancelledError)
  to avoid catching KeyboardInterrupt/SystemExit
- Replace logging.basicConfig() with logging.getLogger(__name__) so
  library code doesn't hijack the root logger
- Use built-in generics (list, dict, tuple) instead of typing imports
  since project requires Python 3.10+
- Replace __getattr__ lazy import with global try/except pattern
- Add test for close() clearing the sandbox registry

Made-with: Cursor
- Make fallback AsyncSandboxClient a class so isinstance() works
- Use logger = logging.getLogger(__name__) in async_connector and
  async_k8s_helper (was still using root logger)
- Add None event guard to resolve_sandbox_name watch loop to prevent
  TypeError crash on None events
- Document no-atexit behavior in AsyncSandboxClient docstring

Made-with: Cursor
self.k8s_helper = k8s_helper

self._base_url: str | None = None
self.client = httpx.AsyncClient(timeout=httpx.Timeout(60.0))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions relying on transport retries for connection errors. However, httpx.AsyncClient defaults to 0 retries. To enable transport-level retries for connection errors, you should explicitly configure the transport.

return response
except httpx.HTTPStatusError as e:
logger.error(f"Request to sandbox failed: {e}")
raise SandboxRequestError(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because both httpx.HTTPStatusError and httpx.RequestError (like ConnectError) inherit from httpx.HTTPError, any network error will be caught here and raise SandboxRequestError immediately without retrying. If you don't use AsyncHTTPTransport(retries=N), network errors will fail on the first attempt.

w = watch.Watch()
logger.info(f"Resolving sandbox name from claim '{claim_name}'...")
try:
async for event in w.stream(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout_seconds parameter tells the Kubernetes API server when to close the stream, but the stream can also drop prematurely due to network issues or API server timeouts. If this happens, the loop exits normally and erroneously raises a TimeoutError even if the time elapsed is short.


async def __aexit__(self, exc_type, exc_val, exc_tb) -> None:
try:
await self.delete_all()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete_all closes the connection too.. why are we having an addition close method check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:python-client cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants